首页> 外文OA文献 >Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce
【2h】

Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

机译:使用局部敏感的可扩展蛋白质序列相似性搜索   Hashing和mapReduce

摘要

Metagenomics is the study of environments through genetic sampling of theirmicrobiota. Metagenomic studies produce large datasets that are estimated togrow at a faster rate than the available computational capacity. A key step inthe study of metagenome data is sequence similarity searching which iscomputationally intensive over large datasets. Tools such as BLAST requirelarge dedicated computing infrastructure to perform such analysis and may notbe available to every researcher. In this paper, we propose a novel approach called ScalLoPS that performssearching on protein sequence datasets using LSH (Locality-Sensitive Hashing)that is implemented using the MapReduce distributed framework. ScalLoPS isdesigned to scale across computing resources sourced from cloud computingproviders. We present the design and implementation of ScalLoPS followed byevaluation with datasets derived from both traditional as well as metagenomicstudies. Our experiments show that with this method approximates the quality ofBLAST results while improving the scalability of protein sequence search.
机译:元基因组学是通过对微生物群进行基因采样对环境进行的研究。元基因组研究产生的大型数据集估计其增长速度快于可用的计算能力。在元基因组数据研究中的关键步骤是序列相似性搜索,在大型数据集上需要大量计算。诸如BLAST之类的工具需要庞大的专用计算基础架构来执行此类分析,并且可能并非对每个研究人员都可用。在本文中,我们提出了一种称为ScalLoPS的新方法,该方法使用通过MapReduce分布式框架实现的LSH(局部敏感哈希)对蛋白质序列数据集进行搜索。 ScalLoPS旨在跨源于云计算提供商的计算资源进行扩展。我们介绍了ScalLoPS的设计和实现,然后进行了从传统以及宏基因组学研究得出的数据集的评估。我们的实验表明,使用这种方法可以近似BLAST结果的质量,同时可以提高蛋白质序列搜索的可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号